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Abstract — The success of sparse representations in image 
modeling has motivated its use in computer vision applications. 
In complex visual recognition tasks it is typical to adopt multiple 
descriptors, that describe different aspects of the data, for 
obtaining improved recognition performance. Descriptors that 
have diverse forms can be fused into a unified feature space in a 
principled manner using kernel methods. Learning sparse models 
in the resulting space will provide highly discriminative sparse 
codes for object recognition and unsupervised clustering. To this 
end, we develop the paradigm of multiple kernel sparse coding 
and propose two different approaches to optimize dictionaries for 
the feature space representations. The first approach works by 
building a separate dictionary for each descriptor set in its own 
feature space and then optimizes them for efficient representation 
in the unified feature space. Whereas, the second approach 
learns dictionaries in the unified feature space directly using the 
ensemble kernel matrices and hence provides a greater flexibility 
in the choice of kernel functions. Finally, we evaluate the utility 
of multiple kernel sparse codes obtained with the proposed 
approaches in object recognition and clustering applications. We 
demonstrate that improvements in performance are obtained 
by fusing multiple descriptors, when compared to using each 
descriptor individually. 

Index Terms — Sparse coding, dictionary learning, multiple 
kernel learning, object recognition, clustering. 

I. Introduction 

A. Sparsity in Image Modeling 

IMAGE understanding has been playing an increasingly 
crucial role in vision applications. Sparse models form an 
important component in image understanding, since the statis- 
tics of natural images reveal the presence of sparse structure. 
In QJ, Olshausen and Field demonstrated that, learning sparse 
linear codes for natural images results in a family of features 
similar to those found in the primary visual cortex. Sparsity 
has been exploited in a variety of signal, and image processing 
applications including compression |2j, denoising |3|, com- 
pressed sensing |4j, source separation [5], face classification 
||6j, and object recognition |7J. By representing data as a sparse 
linear combination of atoms from a "dictionary" matrix, sparse 
methods lead to parsimonious models, in addition to being 
efficient for large scale learning. The generative model for 
representing a data sample y g K M using the sparse code 
a e 



can be written as 



(1) 



where \& is the dictionary of size M x K and n is the noise 
component not represented using the sparse code. The goal 
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is to solve for the code a, also known as the coefficient 
vector, such that it is sparse, i.e. only few of its entries differ 
significantly from zero. Cost functions that are commonly used 
to measure sparsity include the £ and the £i norms, which 
respectively measure the number of non-zero coefficients, and 
the sum of absolute values of the coefficients. The problem of 
sparse coding can be expressed as 



*a| 



AIM 



(2) 



min||y - 

a 

= or 1 to denote the £q norm or its convex 
i norm, respectively, and A is the sparsity 



where we set p 
surrogate, the 

penalty. Some of the algorithms used to solve the sparse 
coding problem in Q include Basis Pursuit [8|, Feature-Sign 
Search |9), Least Angle Regression [10|, Matching Pursuit, 
and Orthogonal Matching Pursuit 1 1 1 1 . The dictionary \& can 
be predefined or obtained as an overcomplete set of vectors 
optimized to the training data |12| . The joint optimization 
problem of dictionary learning and sparse coding can be 
expressed as 



N 



min II Y - \&A 

A 



A^HaJp s.t. Vfc,|hM2<l, 0) 



where the training data matrix Y = [yi, . . . ,yj\r], and the 
coefficient matrix A = [ai, . . . , ajv]. Note that ^ is not 
convex, and hence the sparse codes and the dictionary elements 
are computed using an alternating optimization procedure (2). 
Furthermore, the problem of learning the dictionary has been 
shown to be a generalization of 1-D subspace clustering |13|. 



B. Sparse Coding in Object Recognition and Clustering 

Sparse models have been effective in several image recovery 
applications, and this has motivated their use in computer 
vision problems. One of the first sparse coding based object 
recognition frameworks used codes obtained from raw image 
patches fl4| . However, since then several frameworks have 
been proposed that use sparse codes of local descriptors, such 
as the Scale Invariant Feature Transform (SIFT), extracted 
from images. In order to construct image level descriptors, the 
codes are aggregated in an orderless bag-of-features approach 



15 1 or using the spatial pyramid matching (SPM) approach 
16 1 that partially preserves the spatial ordering. Methods that 



use sparse codes of local descriptors in spatial pyramids have 
achieved better performance (7), fl7)-fl9), in comparison 
to the original SPM approach which is based on vector 
quantization. Furthermore, the authors in JT7) demonstrated 
that spatial pyramid aggregation of sparse codes can lead to 
high object recognition accuracies with just linear classifiers. 



IEEE TRANSACTIONS ON IMAGE PROCESSING 



2 



By incorporating discriminatory constraints, sparse methods 
can be made more effective in recognition tasks. For this 
reason, some algorithms explicitly incorporate class-specific 
discriminatory information when learning the dictionary, and 
this is applied for digit recognition and image classification 
[18 1, [20|-[22|. Furthermore, improved discrimination among 
classes have been obtained by performing simultaneous sparse 
coding to enforce similar non-zero coefficient support for all 
samples in a class [23|-[25|, and coding the descriptors using 
dictionary atoms in their neighborhood [7]. Other approaches 
that lead to discriminative sparse codes are those that require 
the codes to obey constraints induced by the neighborhood 



graphs of the training data |26|, p7| . 

In addition to their widespread applicability in supervised 
learning frameworks, sparsity has also been shown to be useful 
in unsupervised clustering applications. In (28), the authors 
show that clustering graph-regularized sparse codes with K- 
means results in better clustering performance when compared 
to using the data directly. The relation between data samples 
can be inferred by representing each data sample as a sparse 
linear combination of all the others. These sparse codes can 
be then used to build an l\ graph, and spectral clustering can 
be performed on the graph (29") . However, if there are a large 
number of training examples, obtaining sparse codes in this 
fashion can be computationally expensive. In (30) , the authors 
show that this can be avoided by using a fixed size dictionary 
with appropriate constraints. This dictionary can be then used 
to obtain sparse codes and they can be subsequently used to 
perform spectral clustering. 

C. Kernel Sparse Models 

Despite its great applicability, the use of sparse models in 
object recognition presents two main challenges. Firstly, no 
single descriptor can efficiently represent the various aspects 
of the images. Hence, there is a need to integrate multiple 
descriptors extracted from images into the sparse coding 
paradigm. The other challenge is that each descriptor could 
potentially belong to a different space and the similarity 
measure for each descriptor could be defined as a non-linear 
function. The proper way to combine them hence would be to 
fuse the information that each descriptor provides about the 
images, and not the raw descriptors themselves. This can be 
efficiently performed by using appropriate kernel functions, 
which measures the similarity (possibly non-linear) between 
each set of descriptors (31) , and combining the kernel sim- 
ilarities to learn sparse models in the unified feature space. 
The advantage of using multiple diverse features in object 
recognition has been demonstrated in a number of research 
efforts (32)-||34j. 

The sparse models learned in the unified feature space will 
lead to discriminatory codes that can be used with linear clas- 
sifiers. This is because, in this space, the similarity measure 
between the features is linear and hence those that belong to 
the same class will be grouped together in subspaces. Greedy 
approaches to obtain sparse codes using the kernel trick 
have been proposed (35) , (36) . In (37) , Guo et.al. proposed 
to perform kernel sparse coding of image descriptors, and 



design dictionaries for object recognition, when the Radial 
Basis Function (RBF) kernel is used. Furthermore, the authors 
of (38) designed a kernel dictionary learning algorithm for 
digit recognition and demonstrated improved discrimination, 
particularly in the presence of noise. 

D. Contributions 

In this paper, we propose two different approaches for 
obtaining sparse codes and optimizing dictionaries in the 
unified feature space obtained using multiple kernels. Both 
the proposed methods require the extraction of appropriate 
features, and computation of their corresponding base kernels. 



The first approach, described in Section IV works by building 
separate dictionaries to sparsely code each descriptor set, and 
fusing the corresponding kernel matrices to obtain multiple 
kernel sparse representations (MKSRs). The MKSR is a single 
sparse code that represents each training example in the unified 
feature space. Though the fused kernel matrices are used for 
evaluating MKSRs, each dictionary is optimized separately 
using a fixed point algorithm. In the second approach for 
obtaining MKSRs, we directly perform kernel dictionary learn- 
ing using the ensemble kernel matrix of the training images, 
constructed by fusing the kernel similarities of all descriptors 
(Section [V]). In order to obtain kernel dictionaries, we propose 
the kernel multilevel dictionary learning (KMLD) algorithm 
and design a greedy procedure to obtain sparse codes using the 
learned dictionary. In order to evaluate the object recognition 
performance, we extract 8 different image descriptors, and 
compute MKSRs using the two proposed approaches (Section 
|VI-A| >. We report the classification results on the Caltech-101 
1 39 1 and Caltech-256 (40) datasets, and present comparisons 
with other recent sparse coding-based recognition frameworks. 
Simulation results show that the proposed algorithms outper- 
form other baseline sparse coding based object recognition 



approaches (Section VI-B i. Finally, we demonstrate the utility 
of MKSRs in unsupervised learning by performing spectral 
clustering on the coefficient graphs. We show that using MK- 
SRs leads to a better clustering performance, when compared 
to using kernel sparse codes obtained from individual features. 

II. Kernel Sparse Representations and Dictionary 
Learning 

The kernel function maps the non-linearly separable features 
into a high dimensional feature space, in which similar features 
are grouped together, hence improving the linear separability 
(31) . The authors of (37) showed that sparse representations 
can be efficiently performed in a high dimensional feature 
space using kernel methods. In this section, we review the 
problem of kernel sparse representations, and describe the 
procedure to optimize dictionaries using a fixed point iteration 
method when the RBF kernel is used. 

Let us define a function $ : M. H> J-, that maps the data 
samples from the input space to a high dimensional feature 
space. The data sample in the input space y transforms to 
$(y) in the feature space and the N training examples given 
by Y = [yx . . . y N ] transform to $(Y) = [$(yi) . . . $(yjv)]- 
The feature space similarity or the kernel similarity between 
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the training examples y, and is defined using the pre- 
defined kernel function as JC(yi,yj) ■= <l>(y J :) T $(y :) ). The 
dictionary in the feature space is denoted by the matrix, 
= [$(if) 1 ),$(i() 2 ), ...,&(ij) K )], where each column 
indicates a dictionary element. The similarities between dictio- 
nary elements and the training examples can also be computed 
using the kernel function as $(i/? fc ) T $(yj) = JC(tp k ,yj) and 
$(i/? fc ) T $('0;) = K(ip k ,ipi). Since all similarities can be 
computed exclusively using the kernel function, it is not neces- 
sary to know the transformation $. This greatly simplifies the 
computations in the feature space when the similarities are pre- 
computed (kernel trick). We use the notation Kyy € R NxN 
to represent the matrix <£>(Y) T $(Y) and it contains the kernel 
similarities between all training examples. The similarity be- 
tween two training examples (/C(yj, y^)) is the (i, j) th element 
of Kyy, also denoted as K yiyj . 

Sparse coding can be performed in the feature space as 



min||$(y)-$(*)a||2 + A||a|| 1 , 



(4) 



and the objective can be expanded as 

$(yf$(y) - 2a T $(*) T $(y) + a T $(*) T $(*)a + A||a|| 1; 



= K yy — 2a T K^ y + a T K^^a + A||a||i. 



(5) 



Note that we have used the kernel trick here to simplify the 
computations. Following our notation, K yy is the element 
/C(y,y), K^ y is a If X 1 vector containing the elements 
IC(ip k , y), for k = {1, . . . , K} and K^^, is a K x K matrix 
containing the kernel similarities between all the dictionary 
atoms. Clearly, the objective function in |5]) is similar to the 
sparse coding problem, except for the use of kernel simi- 
larities. However, the computation of kernel matrices incurs 
additional complexity. When the kernel sparse codes for all N 
training samples are computed, the dictionary can be obtained 
by minimizing 



N 



i=l 



(6) 



with respect to the constraint that they are normalized in the 
feature space, &(i/> k ) T $(i/> k ) — 1, Vfc = {1,...,K}. In 
order to perform dictionary update in the feature space, the 
authors in [37] proposed a fixed point algorithm. We derive 
expressions for dictionary update using fixed point iteration 
method for the case of the RBF kernel below. 

A. Dictionary Update for RBF Kernel 

The radial basis function (RBF) is a well-known kernel that 
has wide applicability and it is defined as 



£(y*,yj) = cxp(-7||yj - y-j| 



(7) 



where 7 is a positive constant. Note that, JC(ip k ,xp k ) = 1 
and hence we need not enforce the normalization constraint 
$(f/? fe ) T $('0 fc ) = 1 for the dictionary atoms. The objective 
function for dictionary update in |6]l can be expressed as 

K K 



N 

E 

i=l 



K 

1 - 2^a; li /CO i) y i ) + ^2^2ai, i at 1 iK.(il> l ,i> t ) 



1=1 



1=1 t=i 



where a/ ^ represents the i th element in the coefficient vector 
a;. In order to update the dictionary atom ip k , we compute 
the derivative of the simplified objective using the definition 
of RBF kernel in Q, and set the derivative to zero as follows, 



N 



-47 E [~ a^Ofc^Ofc -y*) 



i=l 



K 



+ V a k ,iat,iJC(ip k , ip t )(ip k - ip t )] = 0. (8) 



t=i 



To update the dictionary atoms we can employ a fixed point 
procedure, where the dictionary atom i/) k from the (n — l) th 
is used to compute the kernel similarities for the 71 th iteration 
of the update. Denoting the fc th atom in the n th iteration by 
i/> k n \ we can rewrite the expression in |ij as 

JV 

-47E[- a MWi n_1) .y*)(^ n) -yi) 



K 

t=i 

Solving d9l), we obtain 



,^)(^ n) -VJ] =0. (9) 



K (n_1) Aa T 



Ydiag[K 



t<-(«-1)„T 
^VfcY d k,row 



(n-l)i T 
i/) fc Y J k,row 



Here a. ktrow is a row vector containing the set of coefficients of 
all training vectors corresponding to the dictionary atom ip k , 
and diag[.] creates a diagonal matrix using the argument vector 
as its diagonal. The 1 x K kernel matrix K^ 1 contains the 



elements K(i/) k n for I = {1 . . . K} and the 1 x N 

matrix K^ 1 ^ contains the elements IC(if) k l ,y»), for i = 
{ 1 . . . N} respectively. 

III. Multiple Kernel Sparse Representations 

The use of multiple descriptors to characterize images has 
been a very successful approach for complex visual recog- 
nition tasks. Though this method provides the flexibility of 
choosing features to describe different aspects of the under- 
lying data, the resulting representations are high-dimensional, 
and the descriptors can be very diverse. Hence, in order to 
facilitate recognition tasks, there is a need to transform these 
features to a unified space, and construct low dimensional 
compact representations for the images in the unified space. 
Multiple Kernel Learning (MKL) provides a convenient way 
of fusing multiple descriptors by combining multiple base 
kernels, each of which is created based on an image descriptor 



|32|. In this work, we develop the Multiple Kernel Sparse 
Representations (MKSR) model that aims to compute the 
sparse representation for an image in the unified feature space 
obtained using multiple kernels. In addition to building a 
compact representation for the data in the high-dimensional 
feature space, MKSR can provide highly discriminative codes. 

Let us denote the set of descriptors obtained from the 
data sample as y^ = {yi, r }^=i> where i is the sample 
index, r is the descriptor index, and R is the total number 
of descriptors. Let the set of R base kernel functions, each 
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corresponding to a descriptor be denoted as {/C r }^ =1 , and the 
base kernel matrices be denoted as {K r }^ =1 . The ensemble 
kernel function and matrix can be constructed as the non- 
negative linear combination, 



r=l 



PrKr(yi,r,yj,r), V/3 r > 0, 



(10) 



(11) 



As described earlier, the ensemble kernel matrix contains the 
similarities between the data samples in the unified feature 
space obtained using multiple kernels. Various types of de- 
scriptors such as raw pixel values, histograms, feature vectors, 
and spatial pyramids can be successfully combined by consid- 
ering only the kernel matrices corresponding to the descriptors. 
Given a distance function p r measuring the distance between 
the r th descriptors of two samples, we construct the kernel 
matrix as 



K r (i,j) 



(12) 



Kr(yi,r,yj>) = exp(-7p r (yi, r ,yj,r)), 

where 7 is a positive constant. Computing this kernel function 
is convenient, since several useful image descriptors and their 
corresponding distance functions have been proposed in the 



literature. Note that the kernel matrix in ( 12 1 is not guaranteed 
to be positive semidefinite, which is a required characteristic 
for a proper kernel matrix. In such cases, we follow the 
approach in pT) where we compute the smallest eigenvalue 
of K r and if it is negative, we add its absolute value to the 
diagonal of K r . 

Given a dictionary St and a kernel function /C, sparse 
codes can be obtained as described in Section [TTJ However, 
in the MKSR model we have R different descriptors for each 
image. In order to obtain sparse codes using an ensemble 
kernel matrix, we need to optimize the dictionaries for the 
unified feature space. In this paper, we consider two different 
approaches for obtaining multiple kernel sparse codes and 
designing dictionaries using the kernel trick and they will 
be described in Sections IV and [V] respectively. The first 
approach computes separate dictionaries for each descriptor, 
and obtains sparse codes in the unified feature space. The 
second approach directly builds a multilevel dictionary | |42| 
using the ensemble kernel matrix in the unified feature space. 
As a result, this approach does not require knowledge of the 
mathematical form of the individual kernels. On the other 
hand, the advantage with the first approach is that optimized 
dictionaries for each descriptor, if previously available, can be 
readily used to initialize the learning algorithm. 

IV. Proposed Method 1 

In this section, we present the Method 1, that alternatively 
optimizes separate dictionaries for each image descriptor and 
obtains MKSRs by fusing the individual kernels. Consider 
a dataset of N data samples and R different descriptors 
that characterize them. The kernel function JC r for descriptor 
r can be pre-defined kernel or constructed using the dis- 
tance function d r . For the r th descriptor, we use its corre- 
sponding samples, {yi, r }iLi, to learn the dictionary \& r = 



*i(y«, 



> *i(*i)sA 

► *«(*«) 



*ft(y;.fl) 



Kernel Sparse 
Coding 



Dictionary 
Update 



Fig. 1 . Proposed Method 1 for obtaining multiple kernel sparse representa- 
tions. In this approach, we alternatively optimize the individual dictionaries 
{\E' r }^_ 1 and obtain the unified sparse codes {ai}^^. 



r , '02, r> • • • j ^K.r] tnat can sparsely represent the descrip- 
tor set in its feature space. We compute the kernel matrices as 

K Y Y,r(«,j) = £r(yi,r, y/,r), K* Y ,r(M) = ^(V'fc.r) yi,r). 

and Kw^ r (k,l) = lC r (iftk r> 0i ,r) respectively. Using the R 
descriptors, we obtain the ensemble kernel matrices as, 

R 

Kyy(i,j) = Yl A-KYY,r(», j).VA- > 0, (13) 
r=l 
R 



K*y(M) = XXK* Y)r (i,j),V& > 0, (14) 

r=l 
R 

K*»(M) =XXK**, r (i,j),V/3 r > 0- (15) 



The objective function to be minimized for MKSR can be 
expressed as 

min ||$(y) - $(*)a||| + A||a||i, 



2a T K* y + a T K**a+ A||a||i. 



(16) 



Here $(y) and <p(\J f ) denote the data sample and the dictio- 
nary in the multiple kernel domain. The dictionaries and the 
MKSRs are hence computed in an alternating fashion until the 
objective is minimized. 

A. Updating the Dictionaries 

As illustrated in Figure [T] the proposed Method 1 fuses 
multiple dictionaries to perform sparse coding in the unified 
feature space. Optimization of the dictionaries for efficient 
coding needs to be carried out using a fixed point algorithm 
as described in Section [II] Rewriting the objective function for 
dictionary update, ll$(y») ~ $(*) a i|l2> as 

N 



i=l 1 = 1 

K K 

i=i t=i 

N R K R 



i— 1 r— 1 



1=1 r=l 



K K 



^ ai,iOt,i Y Prt£r(lPl, r , 0t,r) 



1=1 t=l 



(17) 
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Fig. 2. Proposed Method 2 for obtaining multiple kernel sparse represen- 
tations. In this approach, we perform dictionary learning using the ensemble 
kernel in the unified feature space directly. 



In (17 1, we have expanded the ensemble kernel function in 
terms of its individual base kernels. Since there are R different 
dictionaries with K atoms each, the fixed point algorithm 
needs to update the atoms of one dictionary at a time, fixing 
the other R — 1 dictionaries. Hence, to update the fc th atom 



of the * r , we compute the derivative of ( 17 1 with respect to 



ip k r . Excluding all terms in (17i that do not depend on the 



current dictionary atom being updated, the objective can be 
simplified as 



N K 



1 = 1 



K K 



a Li a tA f3 r ]C r {^ l r , ip t r ) 
i=i t=i 



(18) 



The objective in (18i is equivalent to updating the dictionary 



atoms for kernel sparse coding, with the kernel function K, r . 
Note that the same set of coefficients {a^}^, are used for 
updating all the R dictionaries, since the representation is 
computed in the unified feature space. In order to obtain dic- 
tionary update expressions as shown in Section [II] the kernel 
function should be differentiable. Furthermore, the ensemble 
kernel weights {/3 r }r=i balance the relative importance of 
each feature and hence they are chosen empirically such that 
best classification performance is obtained. 



B. Computing Sparse Codes for Test Data 

Given the test data sample x, we extract the R different 
descriptors {x r }^ =1 and evaluate the ensemble kernel matrix 
K^ x of size K x 1 where 



R 

E 

r=l 



P r K, r {ll> k , r ,X r ). 



(19) 



The kernel weights {(3 r }r=i> evaluated empirically during the 
training stage, are used with the test data. The sparse code b 
is then obtained by minimizing the objective 

- 2b T Kxp x + b T Kvpvpb + A||b|| i. 



(20) 



V. Proposed Method 2 



An alternative approach to optimizing the dictionary for 
sparse coding in the unified feature space is to perform 
dictionary learning directly using the ensemble kernel matrix 



of descriptors obtained from the training data. We learn a 
hierarchical dictionary in multiple levels and refer to this as 
kernel multilevel dictionary (KMLD) learning. This algorithm 
is a generalization of the MLD algorithm proposed in J43) to 
the feature space. Such an approach eliminates the need to 
explicitly optimize separate dictionaries and hence provides 
the flexibility to choose any kernel function. The proposed 
framework, referred to as Method 2, is illustrated in Figure 
[2] In J43) , it is shown that global dictionaries, obtained 
by performing 1-D subspace clustering in multiple levels, 
generalize better to novel test samples when compared to 
other dictionary learning methods. In this section, we begin 
by briefly discussing K-lines clustering, the 1-D subspace 
clustering procedure, and the multilevel dictionary (MLD) 
learning algorithm. We then present the kernelized version 
of the K-lines clustering algorithm [44], and develop the 
kernel MLD learning algorithm for designing dictionaries 
using multiple kernels. Finally, we describe the procedure to 
compute the sparse code for a test sample using a kernel MLD. 

A. K-lines Clustering Algorithm 

K-lines clustering algorithm is an iterative procedure that 
performs a least squares fit of K 1-D subspaces to the training 
data [45|. Given the set of T data samples Y = {yi}£Li an d 
the number of clusters K, K-lines clustering proceeds in two 
stages after initialization: the cluster assignment and the cluster 
centroid update. In the cluster assignment stage, a training 
vector Yi is assigned to a cluster k based on the following 
distortion measure. 

d(yi,^ k ) = \\yi-^ k (yl^ k )\\l (21) 



Cluster assignment results in update of the K membership 
sets, {Cfc}fcLi- I n the cluster centroid update stage, given the 
set Ck, the fc th cluster centroid is updated as 

|y< ~ ^k{yJ^ k )\\l sut>j. to ||vy 2 = L 



min > 



(22) 



The principal left singular vector from the singular value 
decomposition (SVD) of = [yi]iec k is the centroid of 
cluster k. Note that, we compute the principal left singular 
vector using an iterative procedure, similar to the Power 
method. We rewrite d22li as 



min > 

,W,.l ^ 



yi - a Kl ip k \\l subj. to \\tjj k \\ 2 2 = 1, (23) 



where {a^} is the set of coefficients for the data samples 
with i E Ck, and solve alternatively for xjj k and {ak,i}. Fixing 
we can compute {ak,i} as 

a k ,i = yf^ k ,yieC k . (24) 

Incorporating the constraint HV'felb = 1 and assuming {a^} 
to be known, we can compute the cluster center as 

^ = ,,^^11 ■ (25) 

Hz^igct a k,tyi\\ 2 

The centroid and the coefficients can be obtained for each 



cluster by repeating ( 24 1 and ( 25 1 for a sufficient number of 
iterations. 
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TABLE I 

The Kernel K-lines Clustering Algorithm. 



Input 

Y = [yi]^L-p M X N matrix of data samples. 
Kyy, N X N kernel matrix. 
K, desired number of clusters. 

Initialization 

- Randomly group data samples to initialize the membership matrix Z. 

- Based on Z, obtain the rank-1 SVD for each cluster to initialize *S?. 

- Compute the initial coefficients, A = Y T \E'. 

Algorithm 

Loop until convergence 

Loop for L iterations 

- Compute H = Z A. 

- Compute A = K YY Hr($(Y)H)- 1 . 
end 

- Update Z by identifying the index of absolute maximum in each 
row of A. 
end 



To express this procedure using matrices, we define the 
membership matrix Z £ M. NxK , where z^. = 1 if and only if 
i £ Ck- Cluster assignment is performed by computing 



A 



(26) 



and then setting Z = <?(A), where g(.) is a function that 
operates on a matrix and returns 1 at the location of absolute 
maximum of each row and zero elsewhere. Let us also define 
the matrix H = Z A, where indicates the Hadamard 
product. The centroid update can be then performed as 



* = YHr(YH)" 1 , 



(27) 



which is the matrix version of the update equation given 



in (25 i. Here, T(.) is a function that operates on a matrix 
and returns a diagonal matrix with the £ 2 norm of each 
column in the argument matrix as its diagonal element. In 
(27 1, r(YH) -1 ensures that the columns of are normal- 



ized. K-lines clustering is hence performed by iterating over 
membership update and centroid computation steps. 



B. Multilevel Dictionary Learning 

The MLD learning algorithm proposed in |43) uses a 
hierarchical approach by employing K-lines clustering to adapt 
atoms in each level of the dictionary. We denote the multilevel 
dictionary as \& = ['iE'iVC^ ■ • • *&s], an d the coefficient matrix 
as A = [A^A^ . . . A^] T . Here, ^f s is the sub-dictionary in 
level s and A s is corresponding the coefficient matrix in level 
s. The approximation in level s is expressed as 



1 — 



= * S A S + K S , for s = 1, S, 



(28) 



where R s -i, R s are the residuals for the levels s — 1 and s 
respectively and Ro = Y. This implies that the residuals in 
level s — 1 serve as the training data for level s. Note that the 
sparsity of the representation in each level is fixed at 1. K-lines 
clustering is employed to learn \I> S from the training matrix, 
R s 1, for that level. Detailed discussion on the generalization 
characteristics of multilevel learning can be found in ||42[. 



C. Kernel K-lines Clustering Algorithm 

In this section, we propose an algorithm to perform K- 
lines clustering in the feature space using kernel similarities. 
Transformation of data to an appropriate feature space leads to 
tighter clusters and hence developing a kernel version of the K- 
lines clustering algorithm may lead to an improved clustering 
accuracy. Using the transformed data samples and dictionary 
elements, the coefficient matrix, A, in feature space can be 



computed in a manner similar to ( 26 1, 
A = $(Y) T $(*) 



(29) 



and the membership matrix is given by Z = g(A). Hence, the 
cluster centers in the feature space can be obtained as 



$(*) = $(Y)Hr($(Y)H) 



(30) 



where H = Z A. The normalization term is computed as 
r($(Y)H) = dw#(($(Y)H) T $(Y)H) 1/2 



diag(H T K YY H) 1/2 , 



(31) 



where diag(.) is an operator that returns a diagonal matrix 
with the diagonal elements same as that of the argument 
matrix. Combining (29i, ( 30 » and pT[ ), we obtain 

A = $(Y) T $(Y)Hr($(Y)H) -1 = K YY Hr($(Y)H)- 1 . 

The steps of this algorithm are presented in Table [I] 



D. Kernel Multilevel Dictionary Learning Algorithm 

Given a set of training samples, our goal is to design 
multilevel dictionaries in the unified feature space obtained 
using multiple kernels. The kernel K-lines clustering procedure 
developed in the previous section can be used to learn the 
atoms in every level of the dictionary. In level s, we denote the 
sub-dictionary by the membership matrix by Z s , the 

coefficient matrix by A s , the input and the residual matrices 
by $(Y S ) and $(R S ) respectively. The training set for the first 
level is Yj = Y. Given the TV training images, we build the 
R descriptors and compute the ensemble kernel matrix K YY 



as given in ( 13 1, 



K YY (i,j) = /C(y 4 ,yj) 



R 



(32) 



As described in the previous section, we can compute 
H s = Z s A s . Performing kernel K-lines clustering in 
level 1 will yield the coefficients Ai = K YY HiDi, where 
Di = r($(Y 1 )Hi)- 1 = dwg(HfK YY Hi)~ 1/2 indicates 
the diagonal matrix that normalizes the dictionary atoms of 
level 1 in the feature space. In kernel MLD learning, the 
residual vectors in a level are used as the training set to the 
next level. Hence, we compute the residuals as 

*(R0 = $(Y0 - $(*i)Hf = $(Y0 - $(Y 1 )H 1 D 1 Hf , 
= 4>(Yi) [I - HiDiH/f] = $(Y 2 ). (33) 

Given the residuals from level 1, the dictionary atoms in level 
2 can be computed as $(* 2 ) = < 5(Y 2 )H 2 D2, where D 2 = 
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TABLE II 

Kernel Multilevel Dictionary Learning algorithm. 



Input 

Yi = [yi,i] i=1 , D x N matrix of training image patches. 

Kw, N x N kernel matrix. 

K, desired number of atoms per level. 

S, total number of levels. 

Initialization 

- Randomly initialize the membership Zi and compute the 
initial coefficients, Ai. 

Algorithm 

For s = 1 to S 

Loop until convergence 

- Loop for L iterations 

- Compute H, = Z,0A s . 



Compute D s = diag 



n(I-H t D t H?) 



'- Kyy (YlZli 1 - H t D t H t T )) H £ 



-1/2 



Evaluate A. = 



[}(I-H t D t Hf) Kyy 



01^(1 -hj^hH) 



end 



H S D S 



- Update Z 3 using index of absolute maximum in each 
row of A s . 

end 
end 



diag (($(Y 2 )H 2 ) T ($(Y 2 )H 2 )) 1/2 . This is simplified as 

D 2 = diag[H% (I - HiDiHf ) T K YY 

(I — HiDiHf ) H 2 ] -1 / 2 . (34) 
Similar to the previous level, the coefficients are evaluated as 
A 2 = $(Y 2 ) T $(* 2 ), 

= (I H^Hf ) T K YY (I H^Hf ) H 2 D 2 . (35) 

Table [II] shows the detailed algorithm to learn a kernel MLD 
by generalizing the procedure for S levels. Note that the 
innermost loop in the algorithm computes the cluster centroids 



using the linearized S VD procedure (Section V-A I. The middle 



loop performs the kernel K-lines clustering for a particular 
level. Similar to Method 7, the weights {f3 r }^ =1 are tuned 
empirically to provide the best classification performance. 



using the dictionary atoms from that level. Similar to the 
training procedure, we first compute the correlations between 
the test sample and all dictionary elements in level 1 as 

«i = $(x) T $(*i) = $(x) T $(Yi)HiDi = K xY HiDi. 

Following this, we determine the 1 x K membership vector 
zi = g(ai) and the coefficient vector hi = zi a.\. The 
residual vector of the test sample can be computed as 

$(ri) = $(x) - $(*i)hf = $(x) - $(Yi)HiDihf . 

To determine a 1-sparse code in level 2, the residual vector 
ri needs to be correlated with the dictionary atoms $( 1 4 r 2 ). 
Generalizing this to any level s, we can evaluate the correla- 
tions between the residual <&(r s _i) and the dictionary atoms 
$(* s ) as 

(37) 



a s = M S H S D S , 



where M s is given by 



K xY - h * D * H f II ( X h p d p h J) k yy 



\p=i 

/s-l 



H t D t Hf) 



(38) 



The coefficient vector corresponding to level s can then be 
obtained as h s = z s a s , where z s = g(ot s ). The overall 
sparse code for the test sample x can be constructed by stack- 
ing coefficient vectors from all levels, b = [b/f bJf . . . h^] T . 
To compute the sparse code for a test sample x, Method 1 



works with the ensemble kernel similarities K^ x £ 



and 



since K < N, it is computationally less intensive compared 
to Method 2, which uses the kernel similarities K x y £ R*. 
However, Method 2 has the important advantage that it does 
not place any restriction on the choice of the kernel function 
or the distance function for building the kernel matrix. 

VI. Object Recognition and Unsupervised 
Clustering 

In this section, we describe the set of image descriptors and 
the kernel functions considered for our simulations and present 
discussions on the recognition performance using Caltech-101 
and Caltech-256 datasets. The same set of features were used 
to compute sparse coding-based graphs for clustering. In all 



cases, we constructed the kernel matrices using ( 12 1, with 
appropriately chosen distance functions between descriptors. 



E. Computing Sparse Codes for Test Data 

In this section, we describe a procedure to evaluate the 
sparse code for a test sample using the kernel MLD. Using 
the R descriptors {x r }^ =1 extracted from a test sample, we 
evaluate the ensemble kernel matrix K x y £ M. lxN where 

R 

K xY («) = /C(x,yj) = y^/3rX r (x r ,yy). (36) 

r—l 

In order to obtain sparse codes for the test sample using the 
kernel MLD, we compute a sparse coefficient for each level 



A. Image Descriptors and Kernels 

SIFT-ScSPM: For a given image, we extracted SIFT features 
p6[ with three scales on a dense grid and used a K-means 
dictionary to obtain sparse codes for the local features. The 
number of dictionary elements was fixed at 1024. For each 
image, we generated the ScSPM feature using the algorithm 
in (T7J and aggregated sparse codes using max-pooling at 
spatial scales 1, 2 and 4 respectively. The kernel matrix was 
constructed based on Euclidean distance between the features. 
SS-ScSPM: The base kernel was constructed in the same way 
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TABLE III 

Comparison of the classification accuracies on the 
Caltech-101 dataset. 



Method 


# Training samples per class 


5 


10 


15 


20 


25 


30 


Zhang et.al. [41 




46.6 


55.8 


59.1 


62 




66.2 


Lazebnik et.al. |16| 






56.4 






64.6 


Griffin et.al. 154 




44.2 


54.5 


59 


63.3 


65.8 


67.6 


Jain et.al. |55| 








65 






70.4 


Boiman et.al. \5t 




_ 


_ 


61 


_ 


_ 


69.1 


Pham et.al. [57] 








42 








Gemert ef.c//. [58] 












64.16 


Yang ei*.a/. 117 








67 






73.2 


Wang et.al. |T9 




51.15 


59.77 


65.43 


67.74 


70.16 


73.44 


Aharon e/.a/.T2 




49.8 


59.8 


65.2 


68.7 


71 


73.2 


Zhang ei.a/. [22 




49.6 


59.5 


65.1 


68.6 


71.1 


73 


Jiang et.al. 1 181 




54 


63.1 


67.7 


70.5 


72.3 


73.6 


MKSR (Method 1) 


58.34 


66.81 


70.83 


74.02 


76.1 


77.8 


MKSR (Method 2) 


58.9 


67.3 


71.44 


74.7 


76.83 


78.01 



as the SIFT-ScSPM, except that the local SIFT descriptor was 
replaced by the self-similarity descriptor (47). The size of the 
patch and the radius of the window were fixed at 5 x 5 and 
40 respectively. 

LBP-ScSPM: Another image descriptor was constructed using 
the ScSPM procedure, for Local Binary Pattern (LBP) [48 1 
features extracted from overlapping patches in an image. 
Gist: The images were resized to 128 x 128 and the gist 
descriptor was extracted from each image (09). The kernel 
matrix was constructed based on Euclidean distance between 
the features. 

PHOG: We extracted the PHOG descriptor from each image 
using the procedure described in [50 1, by fixing the number of 
pyramid levels at 2. We used the Euclidean distance, and the 
X 2 distance between the features for Method 1 and Method 2 
respectively. 

Biologically inspired features (C2-SWP, C2-ML): Biologi- 
cally inspired C2 features proposed in pT) and {52) aim to 
mimic the simple and complex features in the human visual 
system. We extracted C2-SWP and C2-ML features, and used 
the Euclidean distance for both cases. 

Geometric blur: For each image, we randomly sampled 400 



edge pixels and applied the geometric blur descriptor [53 
to them. We used the distance function proposed in |41 
with these descriptors. Note that, because of the form of the 
distance function, this feature cannot be used in a fixed point 
dictionary update scheme. Therefore, when using this feature 
with Method 1, we use the initial dictionary obtained in the 
input space and refrain from updating it in the kernel space. 

In order the initialize the algorithms for obtaining MKSRs, 
the parameters in the descriptors and the distance functions 
were tuned independently. The criteria for tuning is that 
the resulting sparse codes with the base kernels individually 
achieved their best performances in classification using a linear 
SVM. When optimizing dictionaries and computing MKSRs, 
the ensemble kernel weights {(3 r }r=i were empirically tuned, 
again to ensure high classification accuracies with a test set. To 
achieve this, we repeated the algorithms by the randomly split- 




SIFT-ScSPM 
— *— SS-ScSPM 
: > LBP-ScSPM 
— 6— PHOG 
— * — Gist 
^"C2-SWP 
-*"C2"ML 

Geometric Blur 
* MKSR (Method 1) 

i 



Number of Training Image 



Fig. 3. Classification performance obtained by using each base kernel on the 
Caltech-101 dataset, in comparison to multiple kernel sparse representations. 



ting the data samples into train and test sets, and determined 
the weights using cross-validation. We used the MATLAB 
interface of LIBLINEAR, a fast implementation of linear SVM 
for all our simulations [59|. The performance metric used is 
the percentage classification accuracy. 

In addition to object recognition, spectral clustering can 
be performed using the graph created from the kernel sparse 
codes. Using the coefficient matrix B, we constructed the non- 
negative graph weight matrix W = |B T B|. The normalized 
weight matrix is given as L = D _1 / 2 WD -1 / 2 , where D is 
a diagonal matrix with the i th diagonal element containing 
the sum of the all the elements in the i th row or column 
of W. Assuming that there are C clusters, the eigenvectors 
corresponding to the C largest eigenvalues are stacked in the 
matrix U = [U1U2 . . . Ucr]. The rows of the matrix are then 
clustered to obtain the cluster labels. The standard metrics - 
clustering accuracy and normalized mutual information (NMI), 
are used to quantify the clustering performance [29|. 

B. Simulation Results 

1) Caltech-101: The Caltech-101 dataset (60) consists of 
9144 images belonging to 101 object categories and an addi- 
tional class of background images. The number of images in 
each category varies roughly between 40 and 800. We resized 
all images to be no larger than 300 x 300 with the aspect ratio 
preserved. Following the common evaluation procedure, we 
trained the classifiers on 5, 10, 15, 20, 25 and 30 images per 
class and evaluated the performance by testing on the rest. 

For the proposed Method 1, we obtained separate K-Means 
dictionaries (2048 atoms) for each descriptor set in the original 
input space. Following this, we computed the ensemble kernel 
matrices using the expressions in 



13 1, ( 14 1 and ( 15 1 respec 



tively. The sparsity penalty A for computing the kernel sparse 
codes was fixed at 0.3. We also performed kernel dictionary 
learning using the proposed Method 2 described in Section 
|V-D I fixing the number of levels at 16 and the number of atoms 
per level at 128, resulting in a total of 2048 atoms. 

The quantitative results of the proposed multiple kernel 
sparse representations frameworks are presented in Table [Hi] 
The recognition rates presented are averaged over 10 iterations 
with train and test datasets chosen at random. As it can be ob- 
served, the proposed approaches achieve higher classification 
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TABLE IV 

Comparison of the clustering performance obtained using 
graphs, constructed from the kernel sparse codes, on a subset 
of caltech-101. 



-MKSR (Method 1) 
-MKSR (Method 2) 



2«8 

Number of Dictionary Elements 



Fig. 4. Classification accuracies of the proposed MKSR algorithms on the 
Caltech-101 dataset using dictionaries of different sizes. 



rates in comparison to other sparse coding based approaches. 
In order to demonstrate the importance of fusing the multiple 
kernels, we performed object recognition by considering each 
base kernel separately. In each case, we tuned the kernel 
parameter to yield the best recognition performance with 
sparse codes obtained using KMLD. Figure [3] illustrates the 
classification accuracies obtained, with each base kernel, for 
different number of training images per class. For comparison, 
we show the accuracies obtained for the two proposed multiple 
kernel methods as well. The improvement in recognition by 
using multiple kernels is apparent in all cases. For example, 
when N train = 15 the SIFT-ScSPM and the geometric 
blur descriptors provide the best accuracies of 61.65% and 
59.29% respectively, while the C2-SWP descriptor achieves 
a very low recognition rate of 25.8%. However, when all 
the kernels are combined using the proposed methods we 
achieve improvements of 9.18% and 9.79% in the mean 
recognition performance, when compared to using just SIFT- 
ScSPM descriptors. 

Finally, we demonstrate the effect of dictionary size on the 
classification performance for both the proposed methods. We 
varied the number of dictionary elements between 256 and 
4096 and repeated the simulations with 30 training samples per 
class, using 10 different random train and test sets in each case. 
Figure [4] plots the mean percentage classification accuracy for 
different dictionary sizes. We observed that beyond K = 2048, 
the classification rate does not improve significantly with the 
size of the dictionary. 

We evaluated the clustering performance of the kernel sparse 
coding-based graphs using a benchmark subset of the Caltech- 
101 dataset (20 classes) [32 1. Similar to the object recognition 
simulations, we constructed graphs using kernel sparse codes 
of the individual features for comparison. The dictionary size 



was fixed at 2048 for all cases. Table IV shows the clustering 
accuracy and normalized mutual information obtained using 
the different graphs. In each case, both the average and 
maximum values obtained over 50 trials are reported. As it 
can be observed, the ensemble features perform significantly 
better than the individual features. 

2) Caltech-256: The Caltech-256 dataset [54] contains 
30, 607 images in 256 categories and its variability makes 
it extremely challenging in comparison to the Caltech-101 



Feature 


% Accuracy 


NMI 


Ave. 


Max. 


Ave. 


Max. 


SIFT-ScSPM 


60.6 


61.3 


0.63 


0.66 


SS-ScSPM 


52.8 


54.31 


0.52 


0.59 


PHOG 


45.4 


46.1 


0.47 


0.51 


Gist 


44.6 


46.9 


0.41 


0.48 


C2-SWP 


38.31 


39.6 


0.33 


0.36 


C2-ML 


31.8 


33.1 


0.29 


0.35 


GB 


49.6 


50.2 


0.5 


0.53 


MKSR (Method 1) 


69.8 


74.7 


0.74 


0.81 


MKSR (Method 2) 


70.4 


75.13 


0.79 


0.83 



TABLE V 

COMPARISON OF THE CLASSIFICATION ACCURACIES ON THE 
CALTECH-256 DATASET. 



Method 


# Training samples per class 


15 


30 


45 


60 


Gemert et.al. |58| 




27.17 






Griffin et.al. |54f 


28.3 


34.1 






Yang et.al. 1171 


27.73 


34.02 


37.46 


40.14 


Guo et.al. 137 If 


29.77 


35.67 


38.61 


40.3 


Wang et.al. [19] 


34.46 


41.19 


45.31 


47.68 


MKSR (Method 1) 
MKSR (Method 2) 


38.72 
39.22 


45.12 
45.61 


48.65 
49.02 


50.27 
50.51 



dataset. The experimental setup is similar to the previous 
section, and we evaluated the recognition performance with the 
number of training images fixed at 15, 30, 45, and 60 images 
per class respectively. The number of dictionary elements 
were fixed at 4096 for both the algorithms. Table [V] shows 
the recognition rates, and similar to the previous case, the 
proposed algorithms outperform other baseline methods. 

VII. Conclusions 

In this paper, we introduced the paradigm of multiple 
kernel sparse models that provide the ability to perform sparse 
learning in a unified space obtained using multiple kernels. 
The proposed approaches are of importance in complex vision 
problems, wherein they allow us to effectively characterize 
the image data, by fusing a large set of visual features. By 
learning sparse models in the multiple kernel domain, we 
generated compact representations for the images and achieved 
discrimination in the codes by exploiting the interactions 
between the multiple descriptors. Two different approaches 
for obtaining MKSR and optimizing the dictionaries for the 
feature space were developed. The proposed algorithms were 
comprehensively evaluated in supervised object recognition 
and unsupervised clustering tasks and the simulation results 
demonstrate the effectiveness of the proposed MKSR algo- 
rithms. The proposed models can be further extended, with 
appropriate constraints, to employ the resulting sparse codes 
in other frameworks such as semi-supervised learning. 
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